From the June Mastering CorelDRAW newsletter OCR Comes to CorelDRAW 4 Rich Zaleski With the arrival of version 4, CorelDRAW has made a serious move into the page layout arena. With 4's enhancements to Draw's bulk text handling and formatting capabilities, it's only natural that the program's link to scanned-in images should evolve from a tool that just converts bitmaps to editable vector files, to one that can also turn scanned bitmap ‘pictures of text’ into editable text. This is done via the new incorporation of Optical Character Recognition (OCR) functionality into the CorelTRACE program. Trace's implementation of OCR may not be on the level of dedicated OCR programs, but it is functional and has some useful features that you might not expect to find in an add-on to what many perceive as merely an add-on utility itself. In fact, it does a remarkable job of handling this complex task, especially when you consider that many top-of-the- line, standalone OCR packages sell for more than the entire Draw 4 suite of applications. Users with heavy OCR requirements will still find it advantageous to invest in a more robust, dedicated OCR application. But those with occasional or limited need to convert a scanned page of text or incoming fax into editable text, for use in either Draw, a word processor or simply to save as a simple ASCII text file, should find Trace's OCR capabilities adequate for their needs. The OCR Advantage Uncompressed, a full-page, 1 bit (black-and-white) bitmap in Windows .BMP file format will occupy the better part of 500 Kb of precious hard disk space. That same file can be stored in compressed TIFF format, which will cut the file size down to just over 100 Kb, if the page isn't too tightly packed with text. However, if like so many users today you're using disk compression software, much of the advantage usually gained in compressing graphics files is lost, because Stacker, DoubleSpace or whatever compression scheme is being used can't squeeze the file much tighter -- it occupies nearly the original amount of real hard disk space. Compare such bulky file sizes to the 2 or 3 Kb that the same page of text will occupy when converted to ASCII text format, and the advantage to ‘OCR-ing’ any faxes or scanned-in text files that you need to keep on hand is soon evident. And, of course, they become editable at the same time. If you use a scanner, happily the huge bitmap created when scanning pages of text need not ever be stored on your hard disk. Simply make use of Trace's TWAIN interface to scan in the image directly, by choosing Acquire Image from the File menu, then clicking on Acquire. Use Object Linking and Embedding to ‘OLE it’ into PhotoPAINT for cleaning up or deskewing, if necessary, by choosing Edit Image from the Edit menu. Then in Trace select the area of the page that you want to convert to text by clicking and dragging a marquee, then click on the OCR icon. Memory Considerations You should keep in mind that OCR is a memory-intensive task. For example, a full page of text requires over 10 megabytes of memory to process. Even if you've got plenty of available RAM, you may find it necessary to either maintain a very large permanent swap file, avoid using Trace's OCR function while other tasks run in the background, or both. I've choked Trace with a full page of small type, on a 16 Mb system using a 4 Mb swap file. In this case, shutting down other applications allowed the job to proceed to completion. If you're relying on a swap file to provide the needed memory, you have to be willing to accept the performance degradation that comes with virtual memory usage. (Adjust the size of your swap file by double-clicking on the 386 Enhanced icon in the Windows Control Panel.) A solution to the possibility of not being able to have any other memory- intensive apps running while you perform an OCR operation is to set up all the bitmaps on which you need to perform the recognition as a batch trace. Then start the batch process just before leaving the office for the day, when no other apps will be running. In any case, you should click on Modify in the Settings menu, then click on Batch Output, since it’s here that you set the default output directory and the file overwrite/make read only options for all of Trace's output. Trace provides some controls to work with scanned text files of varying quality. Choose OCR Method by clicking on Modify in the Settings menu. The default is designed for 300 dpi bitmaps scanned from hard copy of at least laser printer quality. Settings for dot matrix and fine-quality faxes (200 by 100 dpi) can also be selected. These settings are sticky, and will remain active until you change them or select Default from the main Settings menu. How much of a difference do these settings make? On a one-page test file generated via fax, tracing it in the Normal, rather than Fax, mode produced a text file with 42 errors. With the OCR method set to Fax, the same file converted with only a single error. A Few Rough Spots You'll also notice an option for Check Spelling in this dialog box. In my tests, I found this option to be virtually useless. When Draw, or your word processor, checks spelling and comes across a combination of letters that it doesn't recognize, it offers you the choice of accepting or correcting the spelling error. Trace, however, simply ignores the word and doesn't trace it. I'd rather have the output file say "The spell chec~er needs some improvement," than leave the word out entirely and give me "The spell needs some improvement." At least in the latter case the spell checker in my word processor will have something to catch. This situation is aggravated by the fact that (as far as I've been able to tell) Trace's use of the spell checker does not incorporate any user dictionary that you might have created. Proper names and specialized terms simply get dropped, rather than being flagged by having the rejected letters converted and marked with a ‘~’ or some other uncommon character. All in all, I'd strongly recommend that you give Trace's Check Spelling option a miss. Another area where the OCR function could stand some improvement is in the area of text formatting. In short, it doesn't. It's not bad with straight paragraphs of text, but with columnar data or anything out of the ordinary it just treats each string of text as a line followed by a return and linespace. In the end, despite the unexpected accuracy of the character recognition, you're still likely to face some meaningful editing and reformatting time. Perhaps by the time 5.0 rolls around, we'll at least see Rich Text Format (RTF) output with some semblance of maintaining the format of the original image. As long as we're wishing, limited font identification might be within reach as well. The Forms Approach Having stumbled across the weakest feature of Trace's OCR function, it's time to look at what may be its strongest capability, and is certainly its most intriguing. In addition to the standard OCR operation of converting to an ASCII text file, you have the option of using the Forms tracing method. This routine first examines the bitmap and traces any non-text elements as a graphic in outline and/or centerline method, as appropriate. It then OCRs the text, but rather than saving it as ASCII, it inserts it into the usual .EPS output file created by Trace as strings of Artistic text laid out in the positions appropriate to the image that was traced, but in the default font. It seems to want to use a sans serif font by default, since depending on which fonts are in the Ares FontMinder Font Packs I have loaded, it will be either 12.5-point Avant Garde or Arial. While it's not as fast as straight OCR tracing, this feature is particularly handy when tracing logos with accompanying text, letterheads, maps and technical illustrations. In the accompanying illustrations, I faxed myself a blank invoice and used Trace's Forms method to convert it to .EPS. The first trace took it just over three minutes on my 33 MHz 486 with 16 MB of memory, and it never required disk- based virtual memory. Since Trace does not treat white text on a black background as text, I then saved the .EPS file, cleared the .EPS window (press Delete), inverted the image (choose Modify, then Image Filtering from the Settings menu), and marquee selected the areas containing that text. After running the Forms trace on these ‘leftover’ text strings, I saved the second .EPS file under a different name. I then imported both .EPS files into Draw and placed them side by side. After ungrouping the .EPS file created with the second scan, I changed the fonts as necessary and applied a white fill to them before turning my attention to the other copy. I deleted the curves that represented the white text, used the Node Edit roll-up’s Auto-Reduce function on the larger and more complex curves that made up the form. I changed all the curve segments in the ‘table’ part of the form to lines, and performed minor cleaning up and aligning by snapping the corners to the grid. Finally, I dragged the white text that remained from the second trace on top of the form. Total time from loading the .PCX scan of the form into Trace to printing out virtual duplicates of the original from Draw was just over half an hour. Could I have drawn and lettered the form from scratch more quickly in Draw? I doubt it. Is it for You? If you have heavy-duty text conversion needs, you might not ever use Trace's OCR capabilities, except for perhaps the occasional need to generate a text- inclusive .EPS trace of a mixed text and graphic bitmap. But then again, if your OCR needs are that intensive, you didn't buy CorelDRAW to fill them. That's why Caere and Calera are in business. But for most graphics professionals, who don't deal in lengthy text documents, Trace’s OCR capabilities should fill the bill reasonably well. Those of you interested in trying out Trace’s OCR capabilities for yourselves can use the INV001.PCX file that was placed in the INVOICE directory of this month’s disk when you installed it. This is the scan of the form I discussed in the article. TIP You can also continue an OCR session that halted due to insufficient memory by closing the warning dialog box, selecting a smaller area to process, and then doing the page in two passes. Contents Copyright Kazak Communications 1993 Subscription Information While the regular subscription rate is $75 per year (in US dollars for Americans, Canadian dollars for Canadians), charter subscriptions to the Mastering CorelDRAW newsletter are available for a limited time at $60 (add $30 U.S. for overseas). A free sample disk, from our exclusive disk-of-the-month service (value $20), is included with your paid subscription. To subscribe, or for more information, contact: Chris Dickman 16 Ottawa St. Toronto, ON M4T 2B6 Canada 416-924-0759 (voice) 416-924-4875 (fax) CServe: 70730,2265 - 30 -